Goto

Collaborating Authors

 policy network


Conditioning Matters: Training Diffusion Policies is Faster Than You Think

Neural Information Processing Systems

Diffusion policies have emerged as a mainstream paradigm for building visionlanguage-action (VLA) models. Although they demonstrate strong robot control capabilities, their training efficiency remains suboptimal. In this work, we identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into modeling the marginal action distribution, a phenomenon we term loss collapse. To overcome this, we propose Cocos, a simple yet general solution that modifies the source distribution in the conditional flow matching to be condition-dependent. By anchoring the source distribution around semantics extracted from condition inputs, Cocos encourages stronger condition integration and prevents the loss collapse. We provide theoretical justification and extensive empirical results across simulation and real-world benchmarks. Our method achieves faster convergence and higher success rates than existing approaches, matching the performance of large-scale pre-trained VLAs using significantly fewer gradient steps and parameters. Cocos is lightweight, easy to implement, and compatible with diverse policy architectures, offering a general-purpose improvement to diffusion policy training.



ASimple Decentralized Cross-Entropy Method

Neural Information Processing Systems

Cross-Entropy Method (CEM) is commonly used for planning in model-based reinforcement learning (MBRL) where a centralized approach is typically utilized to update the sampling distribution based on only the top-k operations' results on samples. In this paper, we show that such a centralized approach makes CEM vulnerable to local optima, thus impairing its sample efficiency. To tackle this issue, we propose Decentralized CEM (DecentCEM), a simple but effective improvement over classical CEM, by using an ensemble of CEM instances running independently from one another, and each performing a local improvement of its own sampling distribution. We provide both theoretical and empirical analysis to demonstrate the effectiveness of this simple decentralized approach. We empirically show that, compared to the classical centralized approach using either a single or even a mixture of Gaussian distributions, our DecentCEM finds the global optimum much more consistently thus improves the sample efficiency. Furthermore, we plug in our DecentCEM in the planning problem of MBRL, and evaluate our approach in several continuous control environments, with comparison to the stateof-art CEM based MBRL approaches (PETS and POPLIN). Results show sample efficiency improvement by simply replacing the classical CEM module with our DecentCEM module, while only sacrificing a reasonable amount of computational cost. Lastly, we conduct ablation studies for more in-depth analysis.


Supplementary Material

Neural Information Processing Systems

Then each deterministic NN in {ฯ€w,b | (w,b) Wฯ€}is safe if and only if the system of constraints ฮฆ(ฯ€,X0,Xu,) is not satisfiable. We prove the equivalent claim that there exists a weight vector (w,b) Wฯ€ for which ฯ€w,b is unsafe if and only if ฮฆ(ฯ€,X0,Xu,) is satisfiable. First, suppose that there exists a weight vector (w,b) Wฯ€ for which ฯ€w,b is unsafe and we want to show that ฮฆ(ฯ€,X0,Xu,) is satisfiable. This direction of the proof is straightforward since values of the network's neurons on the unsafe input give rise to a solution of ฮฆ(ฯ€,X0,Xu,). Indeed, by assumption there exists a vector of input neuron values x0 X0 for which the corresponding vector of output neuron values xl = ฯ€w,b(x0) is unsafe, i.e. xl Xu.




example where multi step outperforms one step

Neural Information Processing Systems

As explained in the main text, this section presents an example that is only a slight modification of the one in Figure 4, but where a multi-step approach is clearly preferred over just one step. The data-generating and learning processes are exactly the same (100 trajectories of length 100, discount 0.9, ฮฑ = 0.1for reverse KL regularization). The only difference is that rather than using a behavior that is a mixture of optimal and uniform, we use a behavior that is a mixture of maximally suboptimal and uniform. If we call the suboptimal policy ฯ€ (which always goes down and left in our gridworld), then the behavior for the modified example is ฮฒ = 0.2 ฯ€ +0.8 u, where uis uniform. Results are shown in Figure 7. Figure 7: A gridworld example with modified behavior where multi-step is much better than one-step.


204904e461002b28511d5880e1c36a0f-Supplemental.pdf

Neural Information Processing Systems

Similarly to [6], we consider that all environments have the same underlying Structural Causal Model (SCM) and that the different environments correspond to different interventions on the SCM. We provide here the formal definition for SCMs and interventions. We say that Xi causes Xj if Xi 2Pa(Xj). Definition A.2. (Intervention) [6]: Consider a SCMC =( S,N). An intervention e on C consists of replacing one or several of its structural equations to obtain an intervened SCMCe =( Se,N e) with structural equations: Sej: Xej fj(Pa(Xej),N ej), for j =1,...m (11) The variable Xe is intervened on if Si 6= Sei or Ni 6= Nei .